SmartZoning® Documentation

Leveraging Permitting and Zoning Data to Predict Upzoning Pressure in Philadelphia

Author

Laura Frances and Nissim Lebovits

Published

December 13, 2023

This model and web application prototype were developed for MUSA508, a Master of Urban Spatial Analytics class focused on predictive public policy analytics at the University of Pennsylvania.

1 Background

Growth is critical for a city to continue to densify and modernize. The benefits of growth range from increased public transit use to updating the built environment to be more climate resilient. Growth fuels development and vice versa. Philadelphia is 6th largest city in the US, yet ranks 42nd in cost of living, and so growth is often met with concern. Many residents and preservationists ask: Will growth deteriorate the city’s best features? Will modernization make the city unaffordable to longtime residents?

Balancing growth with affordability is a precarious task for Philadelphia. To date, politicians favor making exceptions for developers parcel-by-parcel rather than championing a citywide smart growth strategy. Zoning advocates need better data-driven tools to broadcast the benefits of a smart growth approach, a planning framework that aims to maximize walkability and transit use to avoid sprawl, that also demonstrates how parcel-by-parcel, or spot zoning, creates unmet development pressure that can drive costs. Designed to support smart growth advocacy, SmartZoning is a prototype web tool that identifies parcels under development pressure with conflicting zoning. Users can strategically leverage the tool to promote proactive upzoning of high-priority parcels, aligning current zoning more closely with anticipated development. This approach aims to foster affordable housing in Philadelphia, addressing one of the city’s most pressing challenges.

Smart Growth meets SmartZoning®

2 Overview

Below, we outline how we developed a predictive model that effectively forecasts future annual development patterns with a low mean absolute error. This model is the basis of SmartZoning: by anticipating future development, it can highly where current zoning may hinder smart growth. We also consider the relationship between development pressure and race, income, and housing cost burden, demonstrating the generalizability of our model across different socioeconomic contexts and investigation the impacts of development locally and city-wide.

Show the code
required_packages <- c("tidyverse", "sf", "acs", "tidycensus", "sfdep", "kableExtra", "conflicted",
                       "gganimate", "tmap", "gifski", "transformr", "ggpubr", "randomForest", "janitor",
                       'igraph', "plotly", "ggcorrplot", "Kendall", "car", "shiny", "leaflet",
                       "classInt")
suppressWarnings(
install_and_load_packages(required_packages)
)

source("utils/viz_utils.R")



urls <- c(
  roads = 'https://opendata.arcgis.com/datasets/261eeb49dfd44ccb8a4b6a0af830fdc8_0.geojson', # for broad and market
  council_dists = "https://opendata.arcgis.com/datasets/9298c2f3fa3241fbb176ff1e84d33360_0.geojson",
  building_permits = building_permits_path,
  permits_bg = final_dataset_path,
  zoning = "https://opendata.arcgis.com/datasets/0bdb0b5f13774c03abf8dc2f1aa01693_0.geojson",
  opa = "data/opa_properties.geojson",
  ols_preds = 'data/model_outputs/ols_preds.geojson',
  rf_test_preds = 'data/model_outputs/rf_test_preds.geojson',
  rf_val_preds = 'data/model_outputs/rf_val_preds.geojson',
  rf_proj_preds = 'data/model_outputs/rf_proj_preds.geojson'
  
)

suppressMessages({
  invisible(
    imap(urls, ~ assign(.y, phl_spat_read(.x), envir = .GlobalEnv))
  )
})

broad_and_market <- roads %>% filter(ST_NAME %in% c('BROAD',"MARKET") | SEG_ID %in% c(440370, 421347,421338,421337,422413,423051,440403,440402,440391,440380))

council_dists <- council_dists %>%
                    select(DISTRICT)

building_permits <- building_permits %>%
                      filter(permittype %in% c("RESIDENTIAL BUILDING", "BP_ADDITION", "BP_NEWCNST"))

3 Select and Engineer Features

This study leverages open data sources including permit counts, council district boundaries, racial mix, median income, housing cost burden to holistically understand what drives development pressure. Generally, data is collected at the block group or parcel level and aggregated up to the council district to capture both local and more citywide trends.

Dataset Source Geo Level
Construction Permits Philadelphia Dept. of Licenses & Inspections Parcel
Zoning Base Map Planning Commission Parcel
Zoning Overlays Planning Commission Parcel
Demographic and Socioeconomic Data U.S. Census Bureau’s ACS 5-Y Block Group
Council District Boundaries and Leadership City of Philadelphia Parcel

3.1 Permits

Firstly, 10 years of permit data from 2012 to 2023 from the Philadelphia Department of Licenses and Inspections are critical to the study. This study filters only for new construction permits granted for residential projects. In the future, filtering for full and substantial renovations could add more nuance to what constitutes as development pressure.

Show the code
tm <- tmap_theme(tm_shape(permits_bg %>% filter(!year %in% c(2012, 2024))) +
        tm_polygons(col = "permits_count", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Permits") +
  tm_facets(along = "year") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE),
  "New Construction Permits per Year\nPhiladelphia, PA")

suppressMessages(
tmap_animation(tm, "assets/permits_animation.gif", delay = 50)
)

Philadelphia Building Permits, 2013 - 2023

The spike in new construction permits in 2021 is reasonably attributed to the expiration of a tax abatement program for developers.

When assessing new construction permit count by Council Districts, a few districts issued the bulk of new permits during that 2021 peak. Hover over the lines to see more about the volume of permits and who granted them.

Show the code
perms_x_dist <- st_join(building_permits, council_dists)

perms_x_dist_sum <- perms_x_dist %>%
                  st_drop_geometry() %>%
                  group_by(DISTRICT, year) %>%
                  summarize(permits_count = n())

perms_x_dist_mean = perms_x_dist_sum %>%
                      group_by(year) %>%
                      summarize(permits_count = mean(permits_count)) %>%
                      mutate(DISTRICT = "Average")

perms_x_dist_sum <- bind_rows(perms_x_dist_sum, perms_x_dist_mean) %>%
                        mutate(color = ifelse(DISTRICT != "Average", 0, 1))

ggplotly(
ggplot(perms_x_dist_sum %>% filter(year > 2013, year < 2024), aes(x = year, y = permits_count, color = as.character(color), group = interaction(DISTRICT, color))) +
  geom_line(lwd = 0.7) +
  labs(title = "Permits per Year by Council District",
       y = "Total Permits") +
  # facet_wrap(~DISTRICT) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
  scale_color_manual(values = c(palette[5], palette[1]))
)

3.2 Feature Engineering by Time and Space

New construction exhibits sizable spatial and temporal autocorrelation. In other words, there is a strong relationship between the number of permits in a given block group and the number of permits in neighboring block groups; as well as between the number of permits issued in a block group in a given year and the number of permits issued in that same block group in the previous year. To account for these relationships, we engineer new features, including both space and time lags. We note that all of these engineered features have strong correlation coefficients with our dependent variable, permits_count, and p-values indicating that these relationships are statistically significant.

Show the code
permits_bg_long <- permits_bg %>%
                    filter(!year %in% c(2024)) %>%
                    st_drop_geometry() %>%
                    pivot_longer(
                      cols = c(starts_with("lag"), dist_to_2022),
                      names_to = "Variable",
                      values_to = "Value"
                    )


ggscatter(permits_bg_long, x = "permits_count", y = "Value", facet.by = "Variable",
   add = "reg.line",
   add.params = list(color = palette[3], fill = palette[5]),
   conf.int = TRUE
   ) + stat_cor(method = "pearson", p.accuracy = 0.001, r.accuracy = 0.01)

3.3 Socioeconomics

Racial Mix (white vs non-white), median income, and housing cost burden are socioeconomic factors that often play an outsized role in affordability in cities like Philadelphia, with a pervasive and persistent history of housing discrimination and systemic disinvestment. This data is all pulled from the US Census Bureau’s American Community Survey 5-Year survey.

Spatially, is clear that non-white communities earn lower median incomes and experience higher rates of extreme rent burden (household spends more than 35% of income on gross rent).

Show the code
med_inc <- tmap_theme(tm_shape(permits_bg %>% filter(year == 2022)) +
        tm_polygons(col = "med_inc", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Med. Inc. ($)") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE),
  "Median Income")
  
race <- tmap_theme(tm_shape(permits_bg %>% filter(year == 2022)) +
        tm_polygons(col = "percent_nonwhite", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Nonwhite (%)") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE),
  "Race")
  
rent_burd <- tmap_theme(tm_shape(permits_bg %>% filter(year == 2022)) +
        tm_polygons(col = "ext_rent_burden", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Rent Burden (%)") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE),
  "Extreme Rent Burden")
  
tmap_arrange(med_inc, race, rent_burd)

Considering the strong spatial relationship between socioeconomics and certain areas of Philadelphia, we will be sure to investigate our model’s generalizability against race and income.

4 Build Predictive Models

“All the complaints about City zoning regulations really boil down to the fact that City Council has suppressed infill housing or restricted multi-family uses, which has served to push average housing costs higher.” - Jon Geeting, Philly 3.0 Engagement Director

SmartZoning® seeks to predict where permits are most likely to be filed as a measure to predict urban growth. As discussed, predicting growth is fraught because growth is influenced by political forces rather than by plans published by the city’s Planning Commission. Comprehensive plans, typically set on ten-year timelines, tend to become mere suggestions, ultimately subject to the prerogatives of city council members rather than serving as steadfast guides for smart growth. With these dynamics in mind, SmartZoning’s prediction model accounts for socioeconomics, council district, and time-space lag.

4.1 Tests for Correlation

The goal is to select variables that most significantly correlate to permit count to include in the predictive model. Correlation is a type of association test. For example, are permit counts more closely associated to population or to median income? Or, do racial mix and rent burden offer redundant insight? These are the types of subtle but important distinctions we aim to seek out.

4.1.1 Correlation Coefficients

Show the code
corr_vars <- c("total_pop",
               "med_inc",
               "percent_nonwhite",
               "percent_renters",
               "rent_burden",
               "ext_rent_burden")

corr_dat <- permits_bg %>% select(all_of(corr_vars), permits_count) %>% select(where(is.numeric)) %>% st_drop_geometry() %>% unique() %>% na.omit()

corr <- round(cor(corr_dat), 2)
p.mat <- cor_pmat(corr_dat)

ggcorrplot(corr, p.mat = p.mat, hc.order = TRUE,
    type = "full", insig = "blank", lab = TRUE, colors = c(palette[2], "white", palette[3])) +
  annotate(
  geom = "rect",
  xmin = .5, xmax = 7.5, ymin = 4.5, ymax = 5.5,
  fill = "transparent", color = "red", alpha = 0.5
)

4.1.2 VIF

To ensure that our predictive model does not have multicollinearity, or multiple values telling the same story about permit counts, we use the VIF test. The table below lists each variables’s VIF score. Variables that have over a 5 are considered to potentially have some multicollinearity, and those over 10 certainly need to be flagged. Generally the council district and zoning overlays such as historic districts may be conflicting.

Based on VIF, we drop:

hist_dist_na 33.120644 district8 32.900255 district4 32.164688 district9 30.834357 district5 29.393431 district3 29.092806 district2 28.580678 district1 27.199582 district7 26.741373 hist_dist_historic_street_paving_thematic_district 26.735386 district6 19.912382 overlay_fne 15.601603 overlay_ne 11.179327 overlay_nis 7.717070 overlay_ndo 6.867713 overlay_fdo 6.574022 overlay_edo 5.595256 overlay_vdo 5.400210

Show the code
ols <- lm(permits_count ~ ., data = permits_bg %>% filter(year < 2024) %>% select(-c(mapname, geoid10, year)) %>% st_drop_geometry())
vif(ols) %>%
  data.frame() %>%
  clean_names() %>%
  select(-df) %>%
  arrange(desc(gvif)) %>%
  kablerize()
gvif gvif_1_2_df
district 1175997.928011 2.173926
hist_dist_na 33.120644 5.755054
hist_dist_historic_street_paving_thematic_district 26.735386 5.170627
overlay_fne 15.601603 3.949886
overlay_ne 11.179327 3.343550
overlay_nis 7.717070 2.777961
overlay_ndo 6.867713 2.620632
overlay_fdo 6.574022 2.563985
overlay_edo 5.595256 2.365429
overlay_vdo 5.400210 2.323835
dist_to_transit 4.378395 2.092462
lag_spat_4_years 3.946301 1.986530
lag_spat_2_years 3.767694 1.941055
lag_spat_5_years 3.722626 1.929411
lag_spat_6_years 3.689699 1.920859
lag_spat_3_years 3.620857 1.902855
lag_spat_7_years 3.581870 1.892583
lag_spat_8_years 3.530726 1.879023
percent_nonwhite 3.147111 1.774010
lag_spat_1_year 2.943444 1.715647
overlay_ctr 2.938630 1.714243
med_inc 2.826465 1.681210
lag_spat_9_years 2.645426 1.626477
hist_dist_rittenhouse_fitler_residential 2.310139 1.519914
overlay_ahc 2.264476 1.504818
lag_4_years 2.260224 1.503404
overlay_min 2.184365 1.477960
hist_dist_spring_garden 2.178088 1.475835
hist_dist_diamond_street 2.126102 1.458116
lag_5_years 2.040912 1.428605
lag_2_years 2.012640 1.418676
overlay_ncp 1.990799 1.410957
hist_dist_girard_estate 1.933372 1.390457
overlay_wwo 1.931780 1.389885
hist_dist_overbrook_farms_historic_district 1.929129 1.388931
lag_6_years 1.896103 1.376991
lag_7_years 1.878892 1.370727
lag_8_years 1.873276 1.368677
overlay_other 1.865672 1.365896
hist_dist_old_city 1.839962 1.356452
lag_3_years 1.836923 1.355331
overlay_eco 1.800232 1.341727
overlay_nca 1.717870 1.310675
hist_dist_awbury_arboretum 1.690225 1.300086
hist_dist_park_mall_temple_universitys_campus 1.683083 1.297337
lag_1_year 1.680436 1.296316
dist_to_2022 1.654760 1.286375
lag_9_years 1.646576 1.283190
overlay_wst 1.616137 1.271274
overlay_nbo 1.611118 1.269298
percent_renters 1.602037 1.265716
overlay_tso 1.462620 1.209388
overlay_ima 1.453865 1.205763
ext_rent_burden 1.444448 1.201852
overlay_yod 1.381172 1.175233
overlay_hhc 1.379881 1.174683
overlay_nco 1.379048 1.174329
hist_dist_parkside 1.375646 1.172880
hist_dist_society_hill 1.348453 1.161229
overlay_tod 1.345925 1.160140
overlay_drc 1.313695 1.146165
overlay_gao 1.307361 1.143399
overlay_wwa 1.299857 1.140113
overlay_ued 1.299239 1.139842
overlay_na 1.287432 1.134651
hist_dist_tudor_east_falls 1.285346 1.133731
overlay_eod 1.283286 1.132822
rent_burden 1.268730 1.126379
overlay_cdo 1.266165 1.125240
hist_dist_league_island_park_aka_f_d_r_park 1.239856 1.113488
hist_dist_manayunk_main_street_historic_district 1.232682 1.110262
total_pop 1.215123 1.102326
overlay_nho 1.194565 1.092961
overlay_cgc 1.189653 1.090712
hist_dist_greenbelt_knoll 1.184422 1.088311
overlay_snm 1.181435 1.086938
overlay_cao 1.179384 1.085995
overlay_ame 1.156756 1.075526
overlay_ahp 1.149727 1.072253
overlay_stm 1.137920 1.066734
overlay_smh 1.135161 1.065439
overlay_env 1.035448 1.017570
overlay_wah 1.031113 1.015437
hist_dist_420_row 1.025286 1.012564
hist_dist_east_logan_street 1.022054 1.010967


Notably, permit count does not have a particularly strong correlation to any of our selected variables. This may lead one to the conclusion that permits are evenly distributed throughout the city. However, as we can see below, there are few block groups with more 50 permits. This indicates that permits are granted on a block by block across all districts. The need for SmartZoning is applicable for most Philadelphia neighborhoods, not just a select few.

Show the code
ggplot(building_permits %>% filter(!year %in% c(2024)), aes(x = as.factor(year))) +
  geom_bar(fill = palette[1], color = NA, alpha = 0.7) +
  labs(title = "Permits per Year",
       y = "Count") +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        aspect.ratio = .75)

Show the code
ggplot(permits_bg %>% st_drop_geometry %>% filter(!year %in% c(2024)), aes(x = permits_count)) +
  geom_histogram(fill = palette[1], color = NA, alpha = 0.7) +
  labs(title = "Permits per Block Group per Year",
       subtitle = "Log-Transformed",
       y = "Count") +
  scale_x_log10() +
  facet_wrap(~year) +
  theme_minimal() +
  theme(axis.title.x = element_blank())

4.2 Examine Spatial Patterns

To to identify spatial clusters, or hotspots, in geographic data, we performed a Local Moran’s I test. It assesses the degree of spatial autocorrelation, which is the extent to which the permit counts in a block group tend to be similar to neighboring block group. We used a p-value of 0.1 as our hotspot threshold.

Show the code
lisa <- permits_bg %>% 
  filter(year == 2023) %>%
  mutate(nb = st_contiguity(geometry),
                         wt = st_weights(nb),
                         permits_lag = st_lag(permits_count, nb, wt),
          moran = local_moran(permits_count, nb, wt)) %>% 
  tidyr::unnest(moran) %>% 
  mutate(pysal = ifelse(p_folded_sim <= 0.1, as.character(pysal), NA),
         hotspot = case_when(
           pysal == "High-High" ~ "Yes",
           TRUE ~ "No"
         ))

# 
# palette <- c("High-High" = "#B20016", 
#              "Low-Low" = "#1C4769", 
#              "Low-High" = "#24975E", 
#              "High-Low" = "#EACA97")

morans_i <- tmap_theme(tm_shape(lisa) +
  tm_polygons(col = "ii", border.alpha = 0, style = "jenks", palette = mono_5_green, title = "Moran's I"),
  "Local Moran's I")

p_value <- tmap_theme(tm_shape(lisa) +
  tm_polygons(col = "p_ii", border.alpha = 0, style = "jenks", palette = mono_5_green, title = "P-Value"),
  "Moran's I P-Value")

sig_hotspots <- tmap_theme(tm_shape(lisa) +
  tm_polygons(col = "hotspot", border.alpha = 0, style = "cat", palette = c(mono_5_green[1], mono_5_green[5]), textNA = "Not a Hotspot", title = "Hotspot?"),
  "New Construction Hotspots")

tmap_arrange(morans_i, p_value, sig_hotspots, ncol = 3)

Emergeging hotspots…? If I can get it to work.

Show the code
# stc <- as_spacetime(permits_bg %>% select(permits_count, geoid10, year) %>% na.omit(),
#                  .loc_col = "geoid10",
#                  .time_col = "year")
# 
# # conduct EHSA
# ehsa <- emerging_hotspot_analysis(
#   x = stc,
#   .var = "permits_count",
#   k = 1,
#   nsim = 5
# )
# 
# count(ehsa, classification)

4.3 Compare Models

Make sure to note that we train, test, and then validate. So these first models are based on 2022 data, and then we run another on 2023 (and then predict 2024 at the end).

There are various regression models available, each with its assumptions, strengths, and weaknesses. We compared Ordinary Least Square, Poisson, and Random Forest. This comparative study allowed us to consider the model’s accuracy, if it overfit, its generalizability, as well as compuationl efficiency.

The Poisson model was unviable because it overvalued outliers and therefore is not detailed below.

4.3.1 OLS

OLS (Ordinary least squares) is a method to explore relationships between a dependent variable and one or more explanatory variables. It considers the strength and direction of these relationships and the goodness of model fit. Our model incorporates three engineered groups of features: space lag, time lag, and distance to 2022. We include this last variable because of the Philadelphia tax abatement policy that led to a significant increase in residential development in the years immediately before 2022 discussed earlier. We used this as a baseline model to compare to Poisson and Random Forest. Given how tightly aligned the observed and predicted prices are we performed dozens of variable combinations to rule out over fitting. We are confident that our variables are generalizable and do not over-fit.

Show the code
suppressMessages(
ggplot(ols_preds, aes(x = permits_count, y = ols_preds)) +
  geom_point() +
  labs(title = "Predicted vs. Actual Permits: OLS",
       subtitle = "2022 Data",
       x = "Actual Permits",
       y = "Predicted Permits") +
  geom_abline() +
  geom_smooth(method = "lm", se = FALSE, color = palette[3]) +
  theme_minimal()
)

Show the code
ggplot(ols_preds, aes(x = abs_error)) +
  geom_histogram(fill = palette[3], color = NA, alpha = 0.7) +
  labs(title = "Distribution of Absolute Error per Block Group",
       subtitle = "OLS, 2022") +
  theme_minimal()

Show the code
ols_mae <- paste0("MAE: ", round(mean(ols_preds$abs_error, na.rm = TRUE), 2))

Our OLS model exhibits a Mean Absolute Error (MAE) of 2.66, a decent performance for a model of its simplicity. However, its efficacy is notably diminished in critical domains where optimization is imperative. Consequently, we intend to enhance the predictive capacity by incorporating more pertinent variables and employing a more sophisticated modeling approach.

Show the code
ols_preds_map <- tmap_theme(tm_shape(ols_preds) +
        tm_polygons(col = "ols_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Predicted Permits: OLS")

ols_error_map <- tmap_theme(tm_shape(ols_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Absolute Error") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Absolute Error: OLS")

tmap_arrange(ols_preds_map, ols_error_map)

We find that our OLS model has an MAE of only MAE: 2.68–not bad for such a simple model! Still, it struggles most in the areas where we most need it to succeed, so we will try to introduce better variables and apply a more complex model to improve our predictions.

4.3.2 Random Forest

OLS and Random Forest represent different modeling paradigms. OLS is a linear regression model suitable for capturing linear relationships, while Random Forest is an ensemble method capable of capturing non-linear patterns and offering greater flexibility in handling various data scenarios. Considering, Random Forest is generally less sensitive to multicollinearity because it considers subsets of features in each tree and averages their predictions and because the effect of outliers tends to be mitigated, we decided it worth investigating Random Forest as an alternative model.

Compared to the OLS model, the relationship between predicted vs actual permits…

Show the code
ggplot(rf_test_preds, aes(x = abs_error)) +
  geom_histogram(fill = palette[3], alpha = 0.7, color = NA) +
  labs(title = "Distribution of Absolute Error per Block Group",
       subtitle = "Random Forest, 2022") +
  theme_minimal()

Show the code
suppressMessages(
ggplot(rf_test_preds, aes(x = permits_count, y = rf_test_preds)) +
  geom_point() +
  labs(title = "Predicted vs. Actual Permits: RF",
       subtitle = "2022 Data",
       x = "Actual Permits",
       y = "Predicted Permits") +
  geom_abline() +
  geom_smooth(method = "lm", se = FALSE, color = palette[3]) +
  theme_minimal()
)

Show the code
rf_test_mae <- paste0("MAE: ", round(mean(rf_test_preds$abs_error, na.rm = TRUE), 2))

Compared to the OLS Model, the Random Forest Model has a similar error distribution however, it exhibits a MAE of….

Show the code
test_preds_map <- tmap_theme(tm_shape(rf_test_preds) +
        tm_polygons(col = "rf_test_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Predicted Permits: RF Test")

test_error_map <- tmap_theme(tm_shape(rf_test_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Absolute Error") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Absolute Error: RF Test") 

tmap_arrange(test_preds_map, test_error_map)

5 Model Validation

Considering Random Forest’s favorable results and attributes for our study compared to OLS, we will train and test our predictive model using the random forest model.

We decided to split our training and testing data up to 2022 in an effort to balance permiting activity pre- and post- tax abatement policy.

[code block here]

We train and test up to 2022–we use this for model tuning and feature engineering.

Having settled on our model features and tuning, we now validate on 2023 data.

Show the code
val_preds_map <- tmap_theme(tm_shape(rf_val_preds) +
        tm_polygons(col = "rf_val_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Predicted Permits: RF Validate")

val_error_map <- tmap_theme(tm_shape(rf_val_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Absolute Error") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Absolute Error: RF Validate")

tmap_arrange(val_preds_map, val_error_map)

Show the code
ggplot(rf_val_preds, aes(x = abs_error)) +
  geom_histogram(fill = palette[3], alpha = 0.7, color = NA) +
  labs(title = "Distribution of Absolute Error per Block Group",
       subtitle = "Random Forest, 2023") +
  theme_minimal()

Show the code
suppressMessages(
ggplot(rf_val_preds, aes(x = permits_count, y = rf_val_preds)) +
  geom_point() +
  labs(title = "Predicted vs. Actual Permits: RF",
       subtitle = "2023 Data",
       x = "Actual Permits",
       y = "Predicted Permits") +
  geom_abline() +
  geom_smooth(method = "lm", se = FALSE, color = palette[3]) +
  theme_minimal()
)

Show the code
rf_val_mae <- paste0("MAE: ", round(mean(rf_val_preds$abs_error, na.rm = TRUE), 2))

We return an MAE of MAE: 2.19.

6 Discussion

6.1 Accuracy

Predominately, our model overpredicts, which is better than underpredicting, as it facilitates new development.

Show the code
nbins <- as.integer(sqrt(nrow(rf_val_preds)))
vline <- mean(rf_val_preds$abs_error, na.rm = TRUE)

ggplot(rf_val_preds, aes(x = abs_error)) +
  geom_histogram(fill = palette[3], alpha = 0.7, fill = NA, bins = nbins) +
  geom_vline(aes(xintercept = vline)) +
  theme_minimal()

6.2 Generalizabiltiy

The constructed boxplot, categorizing observations based on racial composition, indicates that the random forest model generalizes effectively, showcasing consistent and relatively low absolute errors across majority non-white and majority white categories. The discernible similarity in error distributions suggests that the model’s predictive performance remains robust across diverse racial compositions, affirming its ability to generalize successfully.

Show the code
rf_val_preds <- rf_val_preds %>%
                      mutate(race_comp = case_when(
                        percent_nonwhite >= .50 ~ "Majority Non-White",
                        TRUE ~ "Majority White"
                      ))

ggplot(rf_val_preds, aes(y = abs_error, color = race_comp)) +
  geom_boxplot(fill = NA) +
  scale_color_manual(values = c(mono_5_orange[5], mono_5_orange[3])) +
 # labs(title = "C")
  theme_minimal()

We find that error is not related to affordability and actually trends downward with percent nonwhite. (This is probably because there is less total development happening there in majority-minority neighborhoods to begin with, so the magnitude of error is less, even though proportionally it might be more.) Error increases slightly with total pop. This makes sense–more people –> more development.

Our analysis reveals that the error is not correlated with affordability and demonstrates a downward trend in conjunction with the percentage of the nonwhite population. This observed pattern may be attributed to the likelihood that majority-minority neighborhoods experience a comparatively lower volume of overall development, thereby diminishing the absolute magnitude of error, despite potential proportional increases. Additionally, there is a slight increase in error with the total population, aligning with the intuitive expectation that higher population figures correspond to more extensive development activities.

Show the code
rf_val_preds_long <- rf_val_preds %>%
  pivot_longer(cols = c(rent_burden, percent_nonwhite, total_pop, med_inc),
               names_to = "variable", values_to = "value") %>%
  mutate(variable = case_when(
    variable == "med_inc" ~ "Median Income ($)",
    variable == "percent_nonwhite" ~ "Nonwhite (%)",
    variable == "rent_burden" ~ "Rent Burden (%)",
    TRUE ~ "Total Pop."
  ))

ggplot(rf_val_preds_long, aes(x = value, y = abs_error)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = palette[3]) +
  facet_wrap(~ variable, scales = "free_x") +
  labs(title = "Generalizability of Absolute Error",
       x = "Value",
       y = "Absolute Error") +
  theme_minimal()

How does this generalize across council districts?

Show the code
suppressMessages(
  ggplot(rf_val_preds, aes(x = reorder(district, abs_error, FUN = mean), y = abs_error)) +
    geom_boxplot(fill = NA, color = palette[3]) +
    labs(title = "MAE by Council District",
         y = "Mean Absolute Error",
         x = "Council District") +
    theme_minimal()
)

7 Assessing Upzoning Pressure

We can identify conflict between projected development and current zoning.

Look at zoning that is industrial or residential single family in areas that our model suggests are high development risk for 2023:

Show the code
filtered_zoning <- zoning %>%
                     filter(str_detect(CODE, "RS") | str_detect(CODE, "I"),
                            CODE != "I2",
                            !str_detect(CODE, "SP")) %>%
                     st_join(., rf_val_preds %>% select(rf_val_preds))



zoning_map <- tmap_theme(tm_shape(filtered_zoning) +
        tm_polygons(col = "CODE", border.alpha = 0, colorNA = "lightgrey", title = "Zoning Code") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Mismatched Zoning")
  
mismatch <- tmap_theme(tm_shape(filtered_zoning) +
        tm_polygons(col = "rf_val_preds", border.alpha = 0, colorNA = "lightgrey", palette = mono_5_orange, style = "fisher", title = "Predicted New Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Development Pressure")

tmap_arrange(zoning_map, mismatch)

We can extract development predictions at the block level to these parcels and then visualize them by highest need.

Show the code
tmap_mode('view')

filtered_zoning %>%
  filter(rf_val_preds > 10) %>%
tm_shape() +
        tm_polygons(col = "CODE", border.alpha = 0, colorNA = "lightgrey",
                    popup.vars = c('rf_val_preds', 'CODE')) +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE)

Furthermore, we can identify properties with high potential for assemblage, which suggests the ability to accomodate high-density, multi-unit housing.

Show the code
nbs <- filtered_zoning %>% 
  mutate(nb = st_contiguity(geometry))

# Create edge list while handling cases with no neighbors
edge_list <- tibble::tibble(id = 1:length(nbs$nb), nbs = nbs$nb) %>% 
  tidyr::unnest(nbs) %>% 
  filter(nbs != 0)

# Create a graph with a node for each row in filtered_zoning
g <- make_empty_graph(n = nrow(filtered_zoning))
V(g)$name <- as.character(1:nrow(filtered_zoning))

# Add edges if they exist
if (nrow(edge_list) > 0) {
  edges <- as.matrix(edge_list)
  g <- add_edges(g, c(t(edges)))
}

# Calculate the number of contiguous neighbors, handling nodes without neighbors
n_contiguous <- sapply(V(g)$name, function(node) {
  if (node %in% edges) {
    length(neighborhood(g, order = 1, nodes = as.numeric(node))[[1]])
  } else {
    1  # Nodes without neighbors count as 1 (themselves)
  }
})

filtered_zoning <- filtered_zoning %>%
                    mutate(n_contig = n_contiguous)

filtered_zoning %>%
  st_drop_geometry() %>%
  select(rf_val_preds,
         n_contig,
         OBJECTID,
         CODE) %>%
  filter(rf_val_preds > 10,
         n_contig > 2) %>%
  arrange(desc(rf_val_preds)) %>%
  kablerize(caption = "Poorly-Zoned Properties with High Development Risk")
Poorly-Zoned Properties with High Development Risk
rf_val_preds n_contig OBJECTID CODE
868 27.94353 3 1615 ICMX
1548 27.94353 3 2736 IRMX
1587 27.94353 3 2804 IRMX
3420 27.94353 3 6405 RSA5
4667 27.94353 3 9661 RSA5
9169 27.94353 4 20073 ICMX
7517 22.04183 3 16717 RSA5
1768 21.73393 3 3128 IRMX
3640 21.73393 3 6901 ICMX
4957 21.68943 3 10410 ICMX
4958 21.68943 3 10411 RSA5
4959 21.68943 3 10412 ICMX
5245 21.68943 3 11160 RSA5
3934 17.81977 3 7646 ICMX
12326 17.81977 4 25776 RSA5
13578 16.39950 3 27869 IRMX
4460 16.21347 3 9093 RSA5
7726 16.00803 3 17168 ICMX
5088 14.95410 3 10759 IRMX
4512 14.36920 5 9243 IRMX
6014 14.36920 6 13057 ICMX
3041 14.17370 3 5568 ICMX
9842 14.17370 3 21369 RSA5
9843 14.17370 3 21370 ICMX
9845 14.17370 3 21372 RSA5
6645 12.52903 3 14648 ICMX
7280 12.52903 3 16179 RSA5
9912 12.52903 3 21527 ICMX
7833 11.57457 3 17408 RSA5
3957 11.13063 3 7704 IRMX
2138 11.07510 4 3744 IRMX
8143 11.04757 3 18031 RSD3
8656 11.04757 3 19076 RSA3
9409 11.04757 4 20534 RSA2
10175 11.04757 3 22002 RSD1
12605 11.04757 3 26247 RSD1
4146 10.92453 3 8265 IRMX
5108 10.92453 4 10795 IRMX
1536 10.63863 4 2715 ICMX
2422 10.63863 5 4284 IRMX
2941 10.63863 4 5351 RSA5
10490.1 10.63863 3 22527 I3
10810 10.63863 5 23106 ICMX
11135.1 10.63863 8 23678 I3

8 2024 Predictions

Show the code
tmap_mode('plot')

tmap_theme(tm_shape(rf_proj_preds) +
        tm_polygons(col = "rf_proj_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey", title = "Predicted New Permits") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE),
  "Projected New Development, 2024")

9 Web Application

10 Next Steps

11 Appendices